Natasha 2: Faster Non-Convex Optimization Than SGD
نویسنده
چکیده
We design a stochastic algorithm to train any smooth neural network to ε-approximate local minima, using O(ε−3.25) backpropagations. The best result was essentially O(ε−4) by SGD. More broadly, it finds ε-approximate local minima of any smooth nonconvex function in rate O(ε−3.25), with only oracle access to stochastic gradients. ∗V1 appeared on arXiv on this date. V2 and V3 polished writing. This paper is built on, but should not be confused with, the offline method Natasha1 [3] which only finds approximate stationary points. When this manuscript first appeared online, the best rate was indeed T = O(ε−4) by SGD. Several follow-up works appeared after this paper but citing us. This includes stochastic cubic regularization [47] which gives T = O(ε−3.5) in Nov 2017, and Neon+SCSG [10, 49] which gives T = O(ε−3.333) in Nov 2017. These rates are worse than T = O(ε−3.25). Our original method also requires oracle access to Hessian-vector products. However, the follow-up paper of Allen-Zhu and Li [10] enables us to replace the use of Hessian-vector products with stochastic gradient computations. We have revised this manuscript in V3 to reflect this change. ar X iv :1 70 8. 08 69 4v 3 [ m at h. O C ] 2 3 Fe b 20 18
منابع مشابه
Stochastic Variance Reduction for Nonconvex Optimization
We study nonconvex finite-sum problems and analyze stochastic variance reduced gradient (Svrg) methods for them. Svrg and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent (Sgd); but their theoretical analysis almost exclusively assumes convexity. In contrast, we prove non-asymptotic rates of convergence (to stationary...
متن کاملVR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning
In this paper, we propose a simple variant of the original SVRG, called variance reduced stochastic gradient descent (VR-SGD). Unlike the choices of snapshot and starting points in SVRG and its proximal variant, Prox-SVRG, the two vectors of VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much larger learning rates, and also make ...
متن کاملsignSGD: compressed optimisation for non-convex problems
Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. SIGNSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. SIGNSGD can exploit mismatches ...
متن کاملLarger is Better: The Effect of Learning Rates Enjoyed by Stochastic Optimization with Progressive Variance Reduction
In this paper, we propose a simple variant of the original stochastic variance reduction gradient (SVRG) [1], where hereafter we refer to as the variance reduced stochastic gradient descent (VR-SGD). Different from the choices of the snapshot point and starting point in SVRG and its proximal variant, Prox-SVRG [2], the two vectors of each epoch in VRSGD are set to the average and last iterate o...
متن کاملAnnealed Gradient Descent for Deep Learning
Stochastic gradient descent (SGD) has been regarded as a successful optimization algorithm in machine learning. In this paper, we propose a novel annealed gradient descent (AGD) method for non-convex optimization in deep learning. AGD optimizes a sequence of gradually improved smoother mosaic functions that approximate the original non-convex objective function according to an annealing schedul...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1708.08694 شماره
صفحات -
تاریخ انتشار 2017